Skip to content

Comments

fix: flaky streaming e2e#1332

Merged
EItanya merged 2 commits intokagent-dev:mainfrom
supreme-gg-gg:jetc/fix/e2e-test
Feb 19, 2026
Merged

fix: flaky streaming e2e#1332
EItanya merged 2 commits intokagent-dev:mainfrom
supreme-gg-gg:jetc/fix/e2e-test

Conversation

@supreme-gg-gg
Copy link
Contributor

@supreme-gg-gg supreme-gg-gg commented Feb 18, 2026

Sharing a timeout context in all streaming retry attempts is causing trouble, it is now replaced by a fresh context for each retry. The agent not found error is caused by double deleting agents (not the root cause but fixed). Close #1328

Running go test -v github.com/kagent-dev/kagent/go/test/e2e -failfast -count=10 -run TestE2EInvokeInlineAgentWithStreaming did not fail (it previously did).

Copilot AI review requested due to automatic review settings February 18, 2026 22:32
@supreme-gg-gg supreme-gg-gg changed the title fix e2e fix: flaky streaming e2e Feb 18, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the retry logic in the runStreamingTest function to implement a manual per-attempt timeout pattern, replacing the previous shared timeout approach. The changes also remove redundant cleanup code from TestE2EInvokeInlineAgentWithStreaming.

Changes:

  • Replaced retry.OnError with a manual retry loop using exponential backoff for streaming tests
  • Changed from a single 30-second timeout across all attempts to a 10-second timeout per attempt
  • Removed redundant deferred cleanup in TestE2EInvokeInlineAgentWithStreaming (cleanup is already handled via t.Cleanup in setup functions)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 264 to 284
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
stream, err := a2aClient.StreamMessage(ctx, protocol.SendMessageParams{Message: msg})
if err != nil {
cancel()
lastErr = err
continue
}
if msgResult.Status.Message != nil {
text += a2a.ExtractText(*msgResult.Status.Message)

resultList := []protocol.StreamingMessageEvent{}
var text string
for event := range stream {
msgResult, ok := event.Result.(*protocol.TaskStatusUpdateEvent)
if !ok {
continue
}
if msgResult.Status.Message != nil {
text += a2a.ExtractText(*msgResult.Status.Message)
}
resultList = append(resultList, event)
}
resultList = append(resultList, event)
cancel()
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context cancellation timing issue: The context is cancelled immediately after consuming the stream channel (line 284), but it should be deferred to ensure it's cancelled even if an error occurs during stream consumption. The current pattern doesn't follow Go best practices for context cancellation. Move the cancel() call to a defer statement immediately after line 264 to ensure proper cleanup. This is the pattern used elsewhere in the codebase (e.g., go/cli/internal/cli/agent/invoke.go:102-103).

Copilot uses AI. Check for mistakes.
Comment on lines 259 to 261
backoff := defaultRetry.Duration * time.Duration(1<<(attempt-1))
t.Logf("streaming attempt %d failed, retrying in %v: %v", attempt, backoff, lastErr)
time.Sleep(backoff)
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual backoff calculation defaultRetry.Duration * time.Duration(1<<(attempt-1)) is a simple exponential backoff but doesn't use the Factor and Jitter fields from defaultRetry (lines 181-186). The defaultRetry config specifies Factor: 2.0 and Jitter: 0.2, but this implementation ignores both. Either use these fields in the calculation (backoff = Duration * Factor^(attempt-1) with jitter), or consider using wait.ExponentialBackoff or wait.ExponentialBackoffWithContext which properly implements the backoff strategy including jitter.

Copilot uses AI. Check for mistakes.
for event := range stream {
msgResult, ok := event.Result.(*protocol.TaskStatusUpdateEvent)
if !ok {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-attempt timeout has been reduced from 30 seconds (original shared timeout across all attempts) to 10 seconds per attempt. While this provides a clearer timeout boundary per attempt, verify that 10 seconds is sufficient for streaming operations in your test environment. If the streaming response legitimately takes longer than 10 seconds, this could cause flaky test failures.

Suggested change
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)

Copilot uses AI. Check for mistakes.
Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>
Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>
@EItanya EItanya merged commit 804f199 into kagent-dev:main Feb 19, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] e2e test flake, streaming agent

2 participants